Introduction

About The Project

INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations. In the hotel industry, effective management of bookings is crucial for optimizing revenue and ensuring operational efficiency.

Our main objective in this project is to analyze the dataset to find which factors have a high influence on booking cancellations while answering some SMART questions. By examining customer attributes such as room preferences, lead time, meal plan choices, and past booking behavior, we will identify key patterns and trends. Additionally, we will also analyze other variables such as special requests, arrival month and market segments.

SMART Questions

1. Do customers who make special requests cancel their bookings less frequently than those who don’t?
2. Are guests with no previous cancellations more likely to avoid canceling their current booking?
3. What are the key factors that show the strongest correlation with booking cancellations?
4. How do hotel cancellation rates change across different seasons (spring, fall, winter, summer, holidays, low season), and which factors (e.g., lead time, room type/) correlate most strongly with cancellations during each season?
5. What are the differences in cancellation rates based on meal plan type, lead time, room type, and market segment, and what are the 95% confidence intervals for these rates?

About The Dataset

For this project, we have selected a hotel booking dataset containing over 36,000 records of bookings. [INNHotelsGroup dataset](https://www.kaggle.com/datasets/mariyamalshatta/inn-hotels-group) from Kaggle. This dataset offers a comprehensive overview of hotel booking patterns, making it ideal for our analysis on cancellations and no-shows.

hotel_data <- read.csv("../Dataset/INNHotelsGroup.csv")

This dataset contains 36275 observations of 19 variables. Out of these observations, 0 rows contain null values.

str(hotel_data)
## 'data.frame':    36275 obs. of  19 variables:
##  $ Booking_ID                          : chr  "INN00001" "INN00002" "INN00003" "INN00004" ...
##  $ no_of_adults                        : int  2 2 1 2 2 2 2 2 3 2 ...
##  $ no_of_children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_weekend_nights                : int  1 2 2 0 1 0 1 1 0 0 ...
##  $ no_of_week_nights                   : int  2 3 1 2 1 2 3 3 4 5 ...
##  $ type_of_meal_plan                   : chr  "Meal Plan 1" "Not Selected" "Meal Plan 1" "Meal Plan 1" ...
##  $ required_car_parking_space          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ room_type_reserved                  : chr  "Room_Type 1" "Room_Type 1" "Room_Type 1" "Room_Type 1" ...
##  $ lead_time                           : int  224 5 1 211 48 346 34 83 121 44 ...
##  $ arrival_year                        : int  2017 2018 2018 2018 2018 2018 2017 2018 2018 2018 ...
##  $ arrival_month                       : int  10 11 2 5 4 9 10 12 7 10 ...
##  $ arrival_date                        : int  2 6 28 20 11 13 15 26 6 18 ...
##  $ market_segment_type                 : chr  "Offline" "Online" "Online" "Online" ...
##  $ repeated_guest                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_price_per_room                  : num  65 106.7 60 100 94.5 ...
##  $ no_of_special_requests              : int  0 1 0 0 0 1 1 1 1 3 ...
##  $ booking_status                      : chr  "Not_Canceled" "Not_Canceled" "Canceled" "Canceled" ...

Data Dictionary

  • Booking_ID: unique identifier of each booking
  • no_of_adults: Number of adults
  • no_of_children: Number of Children
  • no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
  • no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
  • type_of_meal_plan: Type of meal plan booked by the customer:
    • Not Selected – No meal plan selected
    • Meal Plan 1 – Breakfast
    • Meal Plan 2 – Half board (breakfast and one other meal)
    • Meal Plan 3 – Full board (breakfast, lunch, and dinner)
  • required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
  • room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
  • lead_time: Number of days between the date of booking and the arrival date
  • arrival_year: Year of arrival date
  • arrival_month: Month of arrival date
  • arrival_date: Date of the month
  • market_segment_type: Market segment designation.
  • repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
  • no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
  • no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
  • avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
  • no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
  • booking_status: Flag indicating if the booking was canceled or not.

Cleaning The Dataset

For our project, we are interested in majority of the variables but there are some irrelevant columns with respect to our objective and thus we will drop them.

hotel_data_clean <- hotel_data[, c("no_of_adults", "no_of_children", "no_of_weekend_nights", "no_of_week_nights", "type_of_meal_plan", "required_car_parking_space", "room_type_reserved", "lead_time", "arrival_year", "arrival_month",  "market_segment_type", "repeated_guest", "no_of_previous_cancellations", "no_of_special_requests", "booking_status", 
"avg_price_per_room")]

hotel_data_clean <- na.omit(hotel_data_clean)

After cleaning, we are left with 36275 observations of 16 variables.

write.csv(hotel_data_clean,"../Dataset/INNHotelsGroup_min.csv", row.names = F)

Exploratory Data Analysis (EDA)

1. Summary Statistics

hotel_data_clean$type_of_meal_plan <- as.factor(hotel_data_clean$type_of_meal_plan)
hotel_data_clean$room_type_reserved <- as.factor(hotel_data_clean$room_type_reserved)
hotel_data_clean$booking_status <- as.factor(hotel_data_clean$booking_status)
hotel_data_clean$market_segment_type <- as.factor(hotel_data_clean$market_segment_type)
hotel_data_clean$arrival_year <- as.factor(hotel_data_clean$arrival_year)
hotel_data_clean$arrival_month <- as.factor(hotel_data_clean$arrival_month)
summary(hotel_data_clean)
##   no_of_adults  no_of_children  no_of_weekend_nights no_of_week_nights
##  Min.   :0.00   Min.   : 0.00   Min.   :0.00         Min.   : 0.0     
##  1st Qu.:2.00   1st Qu.: 0.00   1st Qu.:0.00         1st Qu.: 1.0     
##  Median :2.00   Median : 0.00   Median :1.00         Median : 2.0     
##  Mean   :1.84   Mean   : 0.11   Mean   :0.81         Mean   : 2.2     
##  3rd Qu.:2.00   3rd Qu.: 0.00   3rd Qu.:2.00         3rd Qu.: 3.0     
##  Max.   :4.00   Max.   :10.00   Max.   :7.00         Max.   :17.0     
##                                                                       
##     type_of_meal_plan required_car_parking_space   room_type_reserved
##  Meal Plan 1 :27835   Min.   :0.000              Room_Type 1:28130   
##  Meal Plan 2 : 3305   1st Qu.:0.000              Room_Type 2:  692   
##  Meal Plan 3 :    5   Median :0.000              Room_Type 3:    7   
##  Not Selected: 5130   Mean   :0.031              Room_Type 4: 6057   
##                       3rd Qu.:0.000              Room_Type 5:  265   
##                       Max.   :1.000              Room_Type 6:  966   
##                                                  Room_Type 7:  158   
##    lead_time   arrival_year arrival_month      market_segment_type
##  Min.   :  0   2017: 6514   10     : 5317   Aviation     :  125   
##  1st Qu.: 17   2018:29761   9      : 4611   Complementary:  391   
##  Median : 57                8      : 3813   Corporate    : 2017   
##  Mean   : 85                6      : 3203   Offline      :10528   
##  3rd Qu.:126                12     : 3021   Online       :23214   
##  Max.   :443                11     : 2980                         
##                             (Other):13330                         
##  repeated_guest  no_of_previous_cancellations no_of_special_requests
##  Min.   :0.000   Min.   : 0.00                Min.   :0.00          
##  1st Qu.:0.000   1st Qu.: 0.00                1st Qu.:0.00          
##  Median :0.000   Median : 0.00                Median :0.00          
##  Mean   :0.026   Mean   : 0.02                Mean   :0.62          
##  3rd Qu.:0.000   3rd Qu.: 0.00                3rd Qu.:1.00          
##  Max.   :1.000   Max.   :13.00                Max.   :5.00          
##                                                                     
##       booking_status  avg_price_per_room
##  Canceled    :11885   Min.   :  0       
##  Not_Canceled:24390   1st Qu.: 80       
##                       Median : 99       
##                       Mean   :103       
##                       3rd Qu.:120       
##                       Max.   :540       
## 

Summary:
The average price per room in the dataset is 103.4 euros, with a median of 99 euros, but prices can reach up to 540 euros, indicating high-priced outliers. Some entries even show an average price of zero, possibly reflecting promotional deals. Guests typically stay for two weekday nights and one weekend night, with the average number of weekday nights being 2.2 and weekend nights 0.81. Stays can extend to as many as 17 weekday nights. Most bookings involve two adults, and many guests do not bring children. Lead times vary significantly, with an average of 85 days, a median of 57, and some bookings made up to 443 days in advance, suggesting a right-skewed distribution. The data also shows sparse previous cancellations, with an average of just 0.02, and a maximum of 58. Bookings are spread over 2017 and 2018, peaking in August. Additionally, most guests do not make special requests, as the median is zero, although some make up to five requests per booking.

2. Distribution of Booking Status

print(ggplot(hotel_data_clean, aes(x = booking_status)) +
  geom_bar(fill = "lightblue") +
  labs(title = "Booking Status Distribution", x = "Booking Status", y = "Count"))

Summary
The dataset contains 36,275 bookings, with 24,390 classified as “Not Canceled” and 11,885 as “Canceled”. Approximately two-thirds of the bookings were completed, while the remaining one-third were canceled.

3. Analysis of Meal Plan Type

print(ggplot(hotel_data_clean, aes(x = type_of_meal_plan, fill = booking_status)) +
  geom_bar(position = "fill") +
  labs(title = "Cancellation Rate by Meal Plan", x = "Meal Plan", y = "Proportion", fill = "Booking Status"))

Summary   Meal Plan 1 has the highest cancellation rate, where more than half of the bookings are canceled. Meal Plan 2 has a lower cancellation rate, with cancellations and non-cancellations being nearly equal. Meal Plan 3 shows the highest proportion of non-cancelled bookings compared to canceled ones, suggesting customers choosing this meal plan tend to cancel less frequently. For the “Not Selected” category, there is a relatively high cancellation rate, similar to Meal Plan 1.

4. Distribution of Room Types

print(ggplot(hotel_data_clean, aes(x = room_type_reserved, fill = booking_status)) +
  geom_bar(position = "fill") +
  labs(title = "Cancellation Rate by Room Type", x = "Room Type", y = "Proportion", fill = "Booking Status"))

Summary
Room Type 1 and Room Type 6 have the highest cancellation rates, with a large proportion of canceled bookings. Room Type 7 stands out as having the highest proportion of non-cancelled bookings, suggesting it may be a preferred or better-secured type of room. For other room types, the rates are fairly similar, with around 60-70% of bookings not canceled, except for Room Type 1 which has more cancellations.

5. Barplot: Number of Special Requests

print(ggplot(hotel_data_clean, aes(x = factor(no_of_special_requests))) +
  geom_bar(fill = "orange") +
  labs(title = "Number of Special Requests", x = "Number of Special Requests", y = "Count") +
  theme_minimal())

Summary   Most guests make no special requests, with a sharp decline as the number increases. A significant portion makes one request, while two or more requests are increasingly rare.

5. Plot: Market Segment vs Booking Status

contingency_table <- table(hotel_data_clean$market_segment_type, hotel_data_clean$booking_status)

plot_data <- as.data.frame(contingency_table)
colnames(plot_data) <- c("Market_Segment", "Booking_Status", "Count")

print(ggplot(plot_data, aes(x = Market_Segment, y = Count, fill = Booking_Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Booking Status by Market Segment",
       x = "Market Segment",
       y = "Count",
       fill = "Booking Status") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)))

6. Histogram: Lead_time

ggplot(hotel_data_clean, aes(x = lead_time)) + 
  geom_histogram(binwidth = 30, fill = "blue", color = "black") + 
  theme_minimal() + 
  ggtitle("Distribution of Lead Time")

7. Boxplot: avg_price_per_room

ggplot(hotel_data_clean, aes(y = avg_price_per_room)) + 
  geom_boxplot(fill = "orange") + 
  theme_minimal() + 
  ggtitle("Boxplot of Average Price per Room")

Statistical Analysis

1. Do customers who make special requests cancel their bookings less frequently than those who don’t? : Chi-Square Test

Null Hypothesis (H₀): There is no significant association between special requests and booking status.
Alternative Hypothesis (H₁:)There is significant association between special requests and booking status.

hotel_data_clean <- hotel_data_clean %>%
  mutate(has_special_requests = ifelse(no_of_special_requests > 0, "Yes", "No"),
         is_cancelled = ifelse(booking_status == "Canceled", "Canceled", "Not_Canceled"))

special_requests_table <- table(hotel_data_clean$has_special_requests, hotel_data_clean$is_cancelled)

sum(special_requests_table) > 0
## [1] TRUE
special_requests_test <- chisq.test(special_requests_table)

special_requests_props <- prop.table(special_requests_table, margin = 1)

Contingency Table:

print(special_requests_table)
##      
##       Canceled Not_Canceled
##   No      8545        11232
##   Yes     3340        13158

Chi-Square Test Results:

print(special_requests_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  special_requests_table
## X-squared = 2152, df = 1, p-value <2e-16
print(ggplot(hotel_data_clean, aes(x = has_special_requests, fill = is_cancelled)) +
  geom_bar(position = "fill") +
  labs(title = "Cancellation Rates by Special Requests",
    x = "Has Special Requests", y = "Proportion") +
  theme_minimal())

The chi-square test result shows a very small p-value which is much less than the typical significance level of 0.05. Due to this we reject the null hypothesis (H₀) and accept the alternative hypothesis (H₁).

Bookings with special requests have a much lower cancellation rate (20.2%) compared to those without (43.2%). The data strongly suggests that customers who make special requests are more likely to follow through with their bookings and significantly less likely to cancel their bookings.

2. Are guests with no previous cancellations more likely to avoid canceling their current booking? : Chi-Square Test

Null Hypothesis (H₀): There is no association between having previous cancellations and the likelihood of canceling the current booking.
Alternative Hypothesis (H₁): Guests with no previous cancellations are less likely to cancel their current booking.

hotel_data_clean <- hotel_data_clean %>%
  mutate(has_previous_cancellations = ifelse(no_of_previous_cancellations > 0, "Yes", "No"))

previous_cancellations_table <- table(hotel_data_clean$has_previous_cancellations, hotel_data_clean$is_cancelled)
previous_cancellations_test <- chisq.test(previous_cancellations_table)

previous_cancellations_props <- prop.table(previous_cancellations_table, margin = 1)

Contingency Table:

print(previous_cancellations_table)
##      
##       Canceled Not_Canceled
##   No     11869        24068
##   Yes       16          322

Chi-Square Test Results:

print(previous_cancellations_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  previous_cancellations_table
## X-squared = 120, df = 1, p-value <2e-16
print(ggplot(hotel_data_clean, aes(x = has_previous_cancellations, fill = is_cancelled)) +
  geom_bar(position = "fill") +
  labs(title = "Cancellation Rates by Previous Cancellations",
       x = "Has Previous Cancellations", y = "Proportion") +
  theme_minimal())

Due to the p-value being below the significance threshold of 0.05, we reject the null hypothesis. This provides strong statistical evidence to support the alternative hypothesis that guests with previous cancellations are less likely to cancel their current booking.

Surprisingly, bookings from guests with previous cancellations have a much lower cancellation rate (4.73%) compared to those without previous cancellations (33.03%). This represents a substantial difference of about 28.3 percentage points.

3. How do hotel cancellation rates change across different seasons (spring, fall, winter, summer, holidays, low season), and which factors (e.g., lead time, room type/) correlate most strongly with cancellations during each season? : Correlation Analysis

# Convert booking_status to a binary variable
hotel_data_clean$booking_status_binary <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)

# Function to remove columns with zero variance
remove_zero_variance <- function(df) {
  df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}

# Remove columns with zero variance
data_filtered <- remove_zero_variance(select_if(hotel_data_clean, is.numeric))

# Calculate the correlation matrix
cor_data <- cor(data_filtered, use = "complete.obs")

# Visualize the correlation matrix
corrplot(cor_data, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix for Entire Dataset", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

The correlation analysis reveals key insights into factors linked to booking cancellations. Lead time has the strongest positive correlation (0.44), indicating that bookings made further in advance are more likely to be canceled. The number of special requests shows a negative correlation (-0.25), suggesting that personalized arrangements reduce cancellation likelihood. Average price per room has a weak positive correlation (0.14), implying that higher-priced bookings may be slightly more prone to cancellations. Repeated guests are slightly less likely to cancel (-0.11). Factors like the number of adults, children, and length of stay show minimal impact.

Seasonal correlation analysis was conducted to explore how booking behavior varies across spring, summer, fall, and winter. Understanding seasonal patterns in lead time, pricing, and cancellation rates can help customize strategies to reduce cancellations throughout the year. Overall, lead time and customer commitment stand out as key determinants of cancellations.

Seasonality Analysis:

# Define seasons based on arrival month
hotel_data_clean$season <- case_when(
  hotel_data_clean$arrival_month %in% c(3, 4, 5) ~ "Spring",
  hotel_data_clean$arrival_month %in% c(6, 7, 8) ~ "Summer",
  hotel_data_clean$arrival_month %in% c(9, 10, 11) ~ "Fall",
  hotel_data_clean$arrival_month %in% c(12, 1, 2) ~ "Winter",
  TRUE ~ "Unknown"
)
# Convert booking_status to a binary variable
hotel_data_clean$booking_status_binary <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)

Creating subsets for each season:

# Create subsets for each season
spring_data <- subset(hotel_data_clean, season == "Spring")
summer_data <- subset(hotel_data_clean, season == "Summer")
fall_data <- subset(hotel_data_clean, season == "Fall")
winter_data <- subset(hotel_data_clean, season == "Winter")

Remove columns with zero variance:

# Function to remove columns with zero variance
remove_zero_variance <- function(df) {
  df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}

# Remove zero variance columns for each season
spring_data_filtered <- remove_zero_variance(select_if(spring_data, is.numeric))
summer_data_filtered <- remove_zero_variance(select_if(summer_data, is.numeric))
fall_data_filtered <- remove_zero_variance(select_if(fall_data, is.numeric))
winter_data_filtered <- remove_zero_variance(select_if(winter_data, is.numeric))

Calculate correlation matrices:

# Calculate and visualize correlation for Spring
cor_spring <- cor(spring_data_filtered, use = "complete.obs")
corrplot(cor_spring, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix for Spring", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

# Calculate and visualize correlation for Summer
cor_summer <- cor(summer_data_filtered, use = "complete.obs")
corrplot(cor_summer, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix for Summer", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

# Calculate and visualize correlation for Fall
cor_fall <- cor(fall_data_filtered, use = "complete.obs")
corrplot(cor_fall, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix for Fall", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

# Calculate and visualize correlation for Winter
cor_winter <- cor(winter_data_filtered, use = "complete.obs")
corrplot(cor_winter, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix for Winter", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

Fall Correlation Matrix:
The strongest positive correlation with booking status (0.54) is lead time, suggesting that as the lead time increases, there is a higher chance of the booking being canceled. This makes sense, as bookings made far in advance may have a higher likelihood of being reconsidered or cancelled.
Number of special requests has a slight negative correlation with booking status (-0.23), indicating that bookings with more special requests tend to have a lower likelihood of cancellation.

Summer Correlation Matrix:
Lead time continues to show a strong positive correlation (0.43) with booking status, compared to other variables, meaning that longer lead times are associated with cancellations during the summer as well.
Number of special requests shows a moderate negative correlation (-0.30), reinforcing the idea that bookings with special requests are less likely to be cancelled.
Average price per room has a weak positive correlation (0.22), suggesting that higher-priced rooms might be slightly more likely to be cancelled in the summer.

Spring Correlation Matrix:
The positive correlation with lead time remains somewhat significant at 0.29, consistent with the previous seasons. Longer lead times are associated with a higher likelihood of cancellation.
Special requests have the strongest negative correlation (-0.36) in the spring, indicating that bookings with more special requests are much less likely to be canceled during this season.
Average price per room shows a small positive correlation (0.13) with booking status, indicating a slight tendency for higher-priced bookings to be cancelled.

Winter Correlation Matrix:
Lead time remains a positive trending factor, with a correlation of 0.24. Longer lead times during the winter continue to be associated with higher cancellation rates.
The number of special requests has a negative correlation (-0.14), suggesting that special requests still reduce the likelihood of cancellations in the winter, although the effect is less pronounced compared to other seasons.
Average price per room has a slightly stronger positive correlation (0.36), indicating that higher-priced rooms may have a higher chance of being cancelled during the winter season.

4. What are the differences in cancellation rates based on meal plan type, lead time, and market segment, and what are the 95% confidence intervals for these rates?

4.1 Does booking status vary upon Meal Plan Type?

Null Hypothesis (H₀): There is no significant difference in cancellation rates based on the type of meal plan.
Alternative Hypothesis (H₁): There is a significant difference in cancellation rates based on the type of meal plan.

hotel_data_clean$type_of_meal_plan <- as.factor(hotel_data_clean$type_of_meal_plan)
hotel_data_clean$booking_status <- as.factor(hotel_data_clean$booking_status)

contingency_table <- table(hotel_data_clean$type_of_meal_plan, hotel_data_clean$booking_status)

chi_test_result <- chisq.test(contingency_table)

print(chi_test_result)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 278, df = 3, p-value <2e-16

4.2 Does booking status vary upon lead time?

Null Hypothesis (H₀): There is no association between booking status and type of meal plan.
Alternative Hypothesis (H₁): There is an association between booking status and type of meal plan.

hotel_data_clean$canceled <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)

# Subsetting the data for customers who canceled
df_canceled <- subset(hotel_data_clean, canceled == 1)

# Subsetting the data for customers who did not cancel
df_not_canceled <- subset(hotel_data_clean, canceled == 0)

ttestlead_time <- t.test(df_canceled$lead_time, df_not_canceled$lead_time, 
                       alternative = "two.sided", conf.level = 0.95)
ttestlead_time
## 
##  Welch Two Sample t-test
## 
## data:  df_canceled$lead_time and df_not_canceled$lead_time
## t = 81, df = 16886, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  78.3 82.2
## sample estimates:
## mean of x mean of y 
##     139.2      58.9

4.3 Does booking status vary upon Average Room Price?

Null Hypothesis (H₀): There is no significant difference in cancellation rates based on the Average Room Price.
Alternative Hypothesis (H₁): There is a significant difference in cancellation rates based on the Average Room Price.

# Creating a binary variable for cancellation
hotel_data_clean$canceled <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)

# Subsetting the data for customers who canceled
df_canceled <- subset(hotel_data_clean, canceled == 1)

# Subsetting the data for customers who did not cancel
df_not_canceled <- subset(hotel_data_clean, canceled == 0)

# Performing a t-test to compare average price per room between those who canceled and those who didn't
ttest_avg_price_per_room <- t.test(
  df_canceled$avg_price_per_room,
  df_not_canceled$avg_price_per_room,
  alternative = "two.sided",
  conf.level = 0.95
)

# Function to interpret the t-test results
interpret_ttestavg_price_per_room <- function(ttest_result) {
  p_value <- ttest_result$p.value  # Extract the p-value from the t-test result
  
  # Print the test summary
  print(ttest_result)
  
  # Interpret based on the p-value threshold (commonly set at 0.05)
  if (p_value < 0.05) {
    cat("Conclusion:\n")
    cat("There is a statistically significant difference in the average price per room between customers who canceled and those who did not.\n")
    cat("The p-value is less than 0.05, meaning the observed difference in means is unlikely to have occurred by chance.\n")
  } else {
    cat("Conclusion:\n")
    cat("There is no statistically significant difference in the average price per room between customers who canceled and those who did not.\n")
    cat("The p-value is greater than 0.05, so we fail to reject the null hypothesis that the difference in means is zero.\n")
  }
}

# Call the function with the t-test result
interpret_ttestavg_price_per_room(ttest_avg_price_per_room)
## 
##  Welch Two Sample t-test
## 
## data:  df_canceled$avg_price_per_room and df_not_canceled$avg_price_per_room
## t = 28, df = 25929, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   9.92 11.39
## sample estimates:
## mean of x mean of y 
##     110.6      99.9 
## 
## Conclusion:
## There is a statistically significant difference in the average price per room between customers who canceled and those who did not.
## The p-value is less than 0.05, meaning the observed difference in means is unlikely to have occurred by chance.

Conclusion

  1. This analysis suggests that a guest’s previous cancellation history could be a reliable predictor of their future booking behavior. Hotels could use this information to refine their overbooking strategies, potentially allocating more flexible cancellation policies to first-time guests or those without a history of cancellations. Conversely, they might implement stricter policies or require deposits from guests with a history of cancellations to mitigate potential losses.

2.Cancellation rate for bookings without special requests: 8545 / (8545 + 11232) ≈ 43.2%. Cancellation rate for bookings with special requests: 3340 / (3340 + 13158) ≈ 20.2%. There’s a clear difference in cancellation rates, with bookings having special requests showing a significantly lower cancellation rate.